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A commentary on 

Assessment of insert sizes and adapter 
content in FASTQ data from NexteraXT 
libraries 

by Turner, F. S. (2014). Front. Genet. 5:5. 
doi: 10.3389/fgene.2014.00005 

In a recent work, Turner (2014) compared 
the performances of two bioinformat- 
ics programs, cutadapt (Marcel, 2011) 
and AlienTrimmer (Criscuolo and Brisse, 
2013), to trim off exogenous oligonu- 
cleotides from short-insert paired-end 
reads. Turner (2014) suggested that 
AlienTrimmer performed with very 
low sensitivity. Here we show that this 
reported lack of performance was due 
to inappropriate use of AlienTrimmer. 
Indeed, when all relevant oligonucleotide 
sequences to be trimmed off are speci- 
fied, AlienTrimmer performs faster than 
cutadapt and with equally satisfactory 
sensitivity. 

The bioinformatics protocol by Turner 
(2014) allows short-insert paired-end 
reads as well as library preparation 
oligonucleotides occurring within such 
reads to be identified. These exogenous 
oligonucleotide sequences need to be 
removed from such reads as their pres- 
ence may affect negatively downstream 
analyses such as de novo assembly or vari- 
ant detection by mapping approaches 
(e.g., Criscuolo and Brisse, 2013; Bolger 
et al., 2014). To illustrate this protocol, 



Turner (2014) generated paired-end 
reads of length 250 base pairs (bps) 
from Escherichia coli with 96 separate 
libraries prepared using the standard 
dual index Nextera® XT transposon pro- 
tocol. Figure 1A shows an example of 
typical paired reads with short insert 
size from these data. As underlined by 
Turner (2014), when the insert size is 
very short (i.e., less than the length of a 
single read), each paired read is a com- 
posite sequence starting with genomic 
insert sequence in 5', whereas the down- 
stream sequences contain oligonucleotides 
used for library preparation (i.e., reverse 
complement of concatenated primer + 
index + transposase sequences), followed 
by a short stretch of As, and next by 
apparently random sequence with low 
Phred (Ewing and Green, 1998) qual- 
ity scores Q (see Figure 1A). In order to 
trim off these exogenous oligonucleotide 
sequences, Turner (2014) compared the 
respective accuracy of cutadapt and 
AlienTrimmer. Yet, the two programs were 
run using as input to be trimmed off, 
only the transposase sequence, but not 
the other alien oligonucleotide sequences 
(see Figure 1A). With this incomplete 
setting, each trimming program per- 
formed quite differently. As cutadapt 
was set to perform 3' trimming only, 
it detected the transposase sequence as 
well as downstream sequences. In order 
to provide more flexible functional- 
ity and simpler usage, AlienTrimmer 's 



strategy is to detect the specified alien 
oligonucleotides within reads and per- 
form trimming when the matched region 
is close enough to a read end. Therefore, 
when the adapter oligonucleotide (or 
transposase) sequence is far from the 3' 
end because it is followed by unspecified 
index, primer or artefactual nucleotides, 
it is detected by AlienTrimmer but not 
trimmed off (see Figure 1A). As a con- 
sequence, AlienTrimmer did not yield 
accurate read trimming in the way Turner 
(2014) used it (i.e., without specifying 
index, primer and artefactual sequences, 
although they were identified). 

Here, we compared cutadapt (version 
1.3) and AlienTrimmer (version 0.3.2) on 
the original read data (Turner, 2014) by 
providing all oligonucleotide sequences 
to be trimmed off. Following Turner 
(2014) analyses, all paired reads with pre- 
dicted insert size <250bps were gath- 
ered, and every read pair of combined 
length <250bps after quality trimming 
(cut-off Q = 20) was discarded, there- 
fore leading to a subset of ~80,000 
read pairs. The program cutadapt was 
launched with the two reverse comple- 
mented transposase sequences to per- 
form 3' end trimming, as it would run 
slower when specifying more oligonu- 
cleotide sequences, which is unnecessary 
(see above). As AlienTrimmer running 
times are not affected by the number of 
specified alien oligonucleotides to trim 
off, it was launched with all combination 
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FIGURE 1 | Paired reads containing exogenous oligonucleotide 
sequences to be trimmed off, and performance results of two trimming 
programs. (A) Example of short-insert paired-end read with small insert size. 
Phred quality score Q (up to 40) of each nucleotide is represented by skyline 



plot. For better reading, threshold Q= 0, 20, 40 are represented by dashed 
horizontal lines. RC, reverse complement. (B) Plot showing the estimated 
sensitivity of AlienTrimmer (black circles) and cutadapt (red circles) for each 
number of bases to be trimmed within short-insert paired-end reads. 



of exogenous bps (i.e., reverse comple- 
ment of concatenated primer + index + 
transposase sequences for each of the 16 
used indexes + poly-A). Quality- trimming 
option (cut-off Q = 20) was also set for 
both programs in order to trim off low 
quality random bps in 3', sometimes 
occurring when the insert size is very small 
(see Figure 1A). Following Turner (2014) 
criteria, true positive and false negative 
results were defined as trimmed reads leav- 
ing less and more than five bps to be 
trimmed, respectively. Denoting TP and 
FN as the number of true positive and 
false negative results, respectively, the sen- 
sitivity TP I (TP + FN) was estimated 
for each number of bps to be trimmed 
(Figure IB). 

All read pairs with insert size 
<250 bps were processed by cutadapt and 
AlienTrimmer in ~17 and ~8s, respec- 
tively, confirming that AlienTrimmer runs 



fast with multiple input alien oligonu- 
cleotides. Note that we ran the program 
AlienTrimmer compiled with gcj to native 
machine code to observe such fast running 
times: different Java virtual machines were 
tested to execute AlienTrimmer, each run- 
ning from ~2 (Oracle JRE 7) to ~6 (gij 
version 4.8) times slower than the gcj- 
compiled version on these read data. 
Clearly, AlienTrimmer had high sensitivity. 
As expected, AlienTrimmer had moderate 
sensitivity only when the number of bps 
to be trimmed was small (i.e., less than the 
specified k-mer value within a single read, 
see Criscuolo and Brisse, 2013). Indeed, as 
AlienTrimmer was launched with default 
k-mer value (i.e., k = 10), moderate sen- 
sitivity was observed when the number 
of bps to be trimmed was less than 20. 
However, higher sensitivity values were 
observed by setting lower k-mer values 
(not shown). Of note, cutadapt plot in 



Figure IB shows a decreasing trend that 
differs from the increasing one in Turner 
(2014). This could be explained by the 
fact that all paired reads with insert size 
<250bps were analyzed here, whereas 
Turner (2014) considered selected subsets 
from each library. 

In conclusion, AlienTrimmer has 
high sensitivity and speed in remov- 
ing alien oligonuceotide sequences from 
short-insert paired-end reads. This work 
underlines that it is critical to specify all 
possible alien oligonucleotide sequences 
as input for AlienTrimmer (as allowed 
by the protocol presented by Turner, 
2014) and to perform quality trimming 
upstream of alien sequence removal. As 
the speed of AlienTrimmer is not affected 
by the number of input oligonucleotide 
sequences, this feature is a strong advan- 
tage for the process of raw data from read 
archives, for which the oligonucleotide 
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sequences used for library preparation 
may be undocumented. 
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